-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add dcgm-exporter daemonset for nvidia #81
Conversation
version: v1 | ||
spec: | ||
priorityClassName: system-node-critical | ||
serviceAccountName: {{ template "cloudwatch-agent.serviceAccountName" . }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make a different service account for the dcgm exporter instead of using the cloudwatch agent service account? There might be different privileges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, great call. updated
@@ -35,10 +39,12 @@ containerLogs: | |||
manager: | |||
name: | |||
image: | |||
repository: cloudwatch-agent-operator | |||
# repository: cloudwatch-agent-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably need to revert to cloudwatch-agent-operator
tag: 1.0.2 | ||
repositoryDomainMap: | ||
public: public.ecr.aws/cloudwatch-agent | ||
# public: public.ecr.aws/cloudwatch-agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably need to revert this
tag: 1.300031.1b317 | ||
# repository: cloudwatch-agent | ||
# tag: 1.300031.1b317 | ||
repository: tupperware |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there changes to the agent that we need to make to get this to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope. removed test repo
@@ -137,6 +146,9 @@ agent: | |||
config: # optional config that can be provided to override the defaultConfig | |||
defaultConfig: | |||
{ | |||
"agent": { | |||
"debug": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
"metrics_collected": { | ||
"kubernetes": { | ||
"gpu_metrics": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of having this as true, can we just have it an empty map in case we want to add more config options in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed. it was a leftover from the previous revision where there were another agent daemonset specifically for GPU nodes.
name: | ||
image: | ||
repository: nvcr.io/nvidia/k8s/dcgm-exporter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming this will be our mirrored repo eventually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
Description of changes:
dcgm-exporter
daemonset to support NVIDIA GPU metrics with k8sdcgm-exporter
pods where the traffic is limited within node (no cross nodes communication)The new
dcgm-exporter
pod hasnodeAffinity
config that spins up itself in only GPU nodes. The supported GPU instances are hard coded intovalues.yaml
which can be extended when there are more GPU instances/types supported.Below is the list of pods in test cluster where 1 GPU and 1 non-gpu nodes are.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.